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Title: METHOD FOR ROBUST CLASSIFICATION IN SPEECH CODING 

Inventor: Jes Thyssen 

Field of Invention 

The present invention relates generally to a method for improved speech 
classification and, more particularly, to a method for robust speech classification in 
5 speech coding. 

Background of the Invention 

With respect to speech communication, background noise can include passing 
motorists, overhead aircraft, babble noise such as restaurant/cafe type noises, music, 
gi] and many other audible noises. Cellular telephone technology brings the ease of 
4X) communicating anywhere a wireless signal can be received and transmitted. However, 
P the downside with the so called "cellular-age" is that phone conversations may no 
' H longer be private or in an area where communication is even feasible. For example, if a 
m cell phone rings and the user answers it, speech communication is effectuated whether 
M the user is in a quiet park or near a noisy jackhammer. Thus, the effects of background 
035 noise are a major concern for cellular phone users and providers. 

Classification is an important tool in speech processing. Typically, the speech 
signal is classified into a number of different classes, for among other reasons, to place 
emphasis on perceptually important features of the signal during encoding. When the 
speech is clean or free from background noise, robust classification (i.e., low probability 
20 of misclassifying frames of speech) is more readily realized. However, as the level of 
background noise increases, efficiently and accurately classifying the speech becomes 
a problem. 
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In the telecommunication industry, speech is digitized and compressed per ITU 
(International Telecommunication Union) standards, or other standards such as wireless 
GSM (global system for mobile communications). There are many standards depending 
upon the amount of compression and application needs. It is advantageous to highly 
5 compress the signal prior to transmission because as the compression increases, the bit 
rate decreases. This allows more information to transfer in the same amount of bandwidth 
thereby saving bandwidth, power and memory. However, as the bit rate decreases, a 
faithful reproduction of the speech becomes increasingly more difficult. For example, for 
telephone application (speech signal with frequency bandwidth of around 3.3kHz) digital 
10 speech signal is typically 16 bits linear or 128 kbits/s. ITU-T standard G.71 1 is operating at 
41 64 kbits/s or half of the linear PCM (pulse coding modulation) digital speech signal. The 
4" standards continue to decrease in bit rate as demands for bandwidth rise (e.g., G.726 is 32 
p kbits/s; G.728 is 16 kbits/s; G.729 is 8 kbits/s). A standard is currently under development 
% * that will decrease the bit rate even lower to 4 kbits/s. 

y 5 Typically, speech is classified based on a set of parameters, and for those 

parameters, a threshold level is set for determining the appropriate class. When 

O 

q background noise is in the environment (e.g., additive speech and noise at the same 
time), the parameters derived for classification typically overlay or add due to the noise. 
Present solutions include estimating the level of background noise in a given 

20 environment and, depending on that level, varying the thresholds. One problem with 
these techniques is that the control of the thresholds adds another dimension to the 
classifier. This increases the complexity of adjusting the thresholds and finding an 
optimal setting for all noise levels is not generally practical. 



Conexant Docket 99RSS219 
Attorney Docket 50944.8500 



For instance, a commonly derived parameter is pitch correlation, which relates to 
how periodic the speech is. Even in highly voiced speech, such as the vowel sound "a", 
when background noise is present, the periodicity appears to be much less due to the 
random character of the noise. 

Complex algorithms are known in the art which purport to estimate parameters 
based on a reduced noise signal. In one such algorithm, for example, a complete noise 
compression algorithm is run on a noise-contaminated signal. The parameters are then 
estimated on the reduced noise signal. However, these algorithms are very complex 
and consume power and memory from the digital signal processor (DSP). 

Accordingly, there is a need for a less complex method for speech classification 
which is useful at low bit rates. In particular, there is a need for an improved method for 
speech classification whereby the parameters are not influenced by the background 
noise. 

Summary of the Invention 

The present invention overcomes the problems outlined above and provides a 
method for improved speech communication. In particular, the present invention 
provides a less complex method for improved speech classification in the presence of 
background noise. More particularly, the present invention provides a robust method for 
improved speech classification in speech coding whereby the effects of the background 
noise on the parameters are reduced. 

In accordance with one aspect of the present invention, a homogeneous set of 
parameters, independent of the background noise level, is obtained by estimating the 
parameters of the clean speech. 
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Brief Description of the Drawings 

These and other features, aspects and advantages of the present invention will 
become better understood with reference to the following description, appending claims, 
and accompanying drawings where: 
5 Figure 1 illustrates, in block format, a simplified depiction of the typical stages of 

speech processing in the prior art; 

Figure 2 illustrates, in block detail, an exemplary encoding system in accordance 
with the present invention; 

Figure 3 illustrates, in block detail, an exemplary decision logic of Figure 2; and 
rip Figure 4 is a flow chart of an exemplary method in accordance with the present 

01 invention. 

Detailed Description of Preferred Embodiments 
'l] The present invention relates to an improved method for speech classification in 

^ the presence of background noise. Although the methods for speech communication 
Ss5 and, in particular, the methods for classification presently disclosed are particularly 
q suited for cellular telephone communication, the invention is not so limited. For 

example, the method for classification of the present invention may be well suited for a 
variety of speech communication contexts such as the PSTN (public switched telephone 
network), wireless, voice over IP (internet protocol), and the like. 
20 Unlike the prior art methods, the present invention discloses a method which 

represents the perceptually important features of the input signal and performs 
perceptual matching rather than waveform matching. It should be understood that the 
present invention represents a method for speech classification which may be one part 
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of a larger speech coding algorithm. Algorithms for speech coding are widely known in 
the industry. It should be appreciated that one skilled in the art will recognize that 
various processing steps may be performed both prior to and after the implementation 
of the present invention (e.g., the speech signal may be pre-processed prior to the 
5 actual speech encoding; common frame based processing; mode dependent 
processing; and decoding). 

Byway of introduction, Figure 1 broadly illustrates, in block format, the typical 
stages of speech processing known in the prior art. In general, the speech system 100 
includes an encoder 102, transmission or storage 104 of the bit stream, and a decoder 
rip 106. Encoder 1 02 plays a critical role in the system, especially at very low bit rates. 
01 The pre-transmission processes are carried out in encoder 102, such as determining 
W speech from non-speech, deriving the parameters, setting the thresholds, and 

classifying the speech frame. Typically, for high quality speech communication, it is 
" n important that the encoder (usually through an algorithm) consider the kind of signal and 
3j5 based upon the kind, process the signal accordingly. The specific functions of the 
□ encoder of the present invention will be discussed in detail below, however, in general, 
the encoder classifies the speech frame into any number of classes. The information 
contained in the class will help to further process the speech. 

The encoder compresses the signal, and the resulting bit stream is transmitted 
20 1 04 to the receiving end. Transmission (wireless or wireline) is the carrying of the bit 
stream from the sending encoder 102 to the receiving decoder 106. Alternatively, the bit 
stream may be temporarily stored for delayed reproduction or playback in a device such as 
an answering machine or voiced email, prior to decoding. 
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The bit stream is decoded in decoder 1 06 to retrieve a sample of the original speech 
signal. Typically, it is not realizable to retrieve a speech signal that is identical to the 
original signal, but with enhanced features (such as those provided by the present 
invention), a close sample is obtainable. To some degree, decoder 106 may be considered 
the inverse of encoder 102. In general, many of the functions performed by encoder 102 
can also be performed in decoder 106 but in reverse. 

Although not illustrated, it should be understood that speech system 100 may further 
include a microphone to receive a speech signal in real time. The microphone delivers the 
speech signal to an A/D (analog to digital) converter where the speech is converted to a 
digital form then delivered to encoder 102. Additionally, decoder 106 delivers the digitized 
signal to a D/A (digital to analog) converter where the speech is converted back to analog 
form and sent to a speaker. 

Like the prior art, the present invention includes an encoder or similar device 
which includes an algorithm based on a CELP (Code Excited Linear Prediction) model. 
However, in order to achieve toll quality at low bit rates (e.g., 4 kbits/s) the algorithm 
departs somewhat from the strict waveform-matching criterion of known CELP 
algorithms and strives to catch the perceptually important features of the input signal. 
While the present invention may be but one single part of an eX-CELP (extended 
CELP) algorithm, it is helpful to broadly introduce the overall functions of the algorithm. 

The input signal is analyzed according to certain features, such as, for example, 
degree of noise-like content, degree of spike-like content, degree of voiced content, 
degree of unvoiced content, evolution of magnitude spectrum, evolution of energy 
contour, and evolution of periodicity. This information is used to control weighting 
during the encoding/quantization process. The general philosophy of the present 
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method may be characterized as accurately representing the perceptually important 
features by performing perceptual matching rather than waveform matching. This is 
based, in part, on the assumption that at low bit rates waveform matching is not 
sufficiently accurate to faithfully capture all information in the input signal. The 
5 algorithm, including the present invention section, may be implemented in C-code or 
any other suitable computer or device language known in the industry such as 
assembly. While the present invention is conveniently described with respect to the eX- 
CELP algorithm, it should be appreciated that the method for improved speech 
classification herein disclosed may be but one part of an algorithm and may be used in 
AO similar known or yet to be discovered algorithms. 

Cll In one embodiment, a voice activity detection (YAD) is embedded in the encoder 

U in order to provide information on the characteristic of the input signal. The VAD 
f;* information is used to control several aspects of the encoder, including estimation of the 
m signal to noise ratio (SNR), pitch estimation, some classification, spectral smoothing, 
3j5 energy smoothing, and gain normalization. In general, the VAD distinguishes between 
o speech and non-speech input. Non-speech may include background noise, music, 

silence, or the like. Based on this information, some of the parameters can be 

estimated. 

Referring now to Figure 2, an encoder 202 illustrates, in block format, the 
20 classifier 204 in accordance with one embodiment of the present invention. Classifier 
204 suitably includes a parameter-deriving module 206 and a decision logic 208. 
Classification can be used to emphasize the perceptually important features during 
encoding. For example, classification can be used to apply different weight to a signal 
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frame. Classification does not necessarily affect the bandwidth, but it does provide 
information to improve the quality of the reconstructed signal at the decoder (receiving 
end). However, in certain embodiments it does affect the bandwidth (bit-rate) by 
varying also the bit-rate according to the class information and not just the encoding 
process. If the frame is background noise, then it may be classified as such and it may 
be desirable to maintain the randomness characteristic of the signal. However, if the 
frame is voice speech, then it may be important to keep the periodicity of the signal. 
Classifying the speech frame provides the remaining part of the encoder with 
information to enable emphasis to be placed on the important features of the signal (i.e., 
"weighting"). 

Classification is based on a set of derived parameters. In the present 
embodiment, classifier 204 includes a parameter-deriving module 206. Once the set of 
parameters is derived for a particular frame of speech, the parameters are measured 
either alone or in combination with other parameters by decision logic 208. The details 
of decision logic 208 will be discussed below, however, in general, decision logic 208 
compares the parameters to a set of thresholds. 

By way of example, a cellular phone user may be communicating in a particularly 
noisy environment. As the level of background noise increases, the derived parameters 
may change. The present invention proposes a method which, on the parameter level, 
removes the contribution due to the background noise, thereby generating a set of 
parameters that are invariant to the level of background noise. In other words, one 
embodiment of the present invention includes deriving a set of homogeneous 
parameters instead of having parameters that vary with the level of background noise. 
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This is particularly important when distinguishing between different kinds of speech, e.g. 
voiced speech, unvoiced speech, and onset, in the presence of background noise. To 
accomplish this, parameters for the noise contaminated signal are still estimated, but 
based on those parameters and information of the background noise, the component 
5 due to the noise contribution is removed. An estimation of the parameters of the clean 
signal (without noise) is obtained. 

With continued reference to Figure 2, the digital speech signal is received in 
encoder 202 for processing. There maybe occasions when other modules within 
encoder 21 0 can suitably derive some of the parameters, rather than classifier 204 
flp re-deriving the parameters. In particular, a pre-processed speech signal (e.g., this may 
01 include silence enhancement, high-pass filtering, and background noise attenuation), 
W the pitch lag and correlation of the frame, and the VAD information may be used as 
H input parameters to classifier 204. Alternatively, the digitized speech signal or a 
q combination of both the signal and other module parameters are input to classifier 204. 
ft]5 Based on these input parameters and/or speech signals, parameter-deriving module 
□ 206 derives a set of parameters which will be used for classifying the frame. 

In one embodiment, parameter-deriving module 206 includes a basic parameter- 
deriving module 212, a noise component estimating module 214, a noise component 
removing module 216, and an optional parameter-deriving module 218. In one aspect of 
20 the present embodiment, basic parameter-deriving module 212 derives three 

parameters, spectral tilt, absolute maximum, and pitch correlation, which can form the 
basis for the classification. However, it should be recognized that significant processing 
and analysis of the parameters may be performed prior to the final decision. These first 
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few parameters are estimations of the signal having both the speech and noise 
component. The following description of parameter-deriving module 206 includes an 
example of preferred parameters, but in no way should it be construed as limiting. The 
examples of parameters with the accompanying equations are intended for 
5 demonstration and not necessarily as the only parameters and/or mathematical 
calculations available. In fact, one skilled in the art will be quite familiar with the 
following parameters and/or equations and may be aware of similar or equivalent 
substitutions which are intended to fall within the scope of the present invention. 



Spectral tilt is an estimation of the first reflection coefficient four times per frame, 



where L = 80 is the window over which the reflection coefficient may be suitably 
U calculated and Sk(n) is the /r* segment given by: 

□ JtOO = i(* • 40 - 20 +«)• *»(«)» «=0,1,...79, p 

15 where w h (n) is a 80 sample Hamming window known in the industry and s(0), 
s(1 ),...,s(159) is the current frame of the pre-processed speech signal. 

Absolute maximum is the tracking of absolute signal maximum eight estimates 
per frame, given by: 



rip given by: 



K(k) = 




i = 0,l,...,3, 




Conexant Docket 99RSS219 
Attorney Docket 50944.8500 



10 



where n s (k) and n s (k) are the starting point and ending point, respectively, for the 
search of the f^ h maximum at time /c1 60/8 samples of the frame. In general, the length 
of the segment is 1 .5 times the pitch period and the segments overlap. In this way, a 
smooth contour of the amplitude envelope is obtained. 

Normalized standard deviation of pitch lag indicates the pitch period. For 
example, in voice speech the pitch period is stable, and for non-voice speech it is 
unstable: 



where L p (m) is the input pitch lag, and ju Lp (m) is the mean of the pitch lag over the past 
three frames, given by: 



In one embodiment, noise component estimating module 214 is controlled by the 
VAD. For instance, if the VAD indicates that the frame is non-speech (i.e., background 
noise), then the parameters defined by noise component estimating module 214 are 
updated. However, if the VAD indicates that the frame is speech, then module 214 is 
not updated. The parameters defined by the following exemplary equations are suitably 
estimated/sampled 8 times per frame providing a fine time resolution of the parameter 
space. 

Running mean of the noise energy is an estimation of the energy of the noise, 
given by: 
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(4) 




(5) 



< E H „ (k)>=a l < E WiP (* - 1) > +(1 -a , ) (A) , (6 

where E NtP (k) is the normalized energy of the pitch period at time /c-160/8 samples of 
the frame. It should be noted that the segments over which the energy is calculated 
may overlap since the pitch period typically exceeds 20 samples (160 samples/8). 

Running mean of the spectral tilt of the noise, given by: 
<K M (k)>**a l *<K H (k~l)> +(l-a l )-ic(£inod2). 

Running mean of the absolute maximum of the noise given by: 
<Xy(*)>=tt 1 *<z^(^" 1 )>-Ki-ct])-%W. 

Running mean of the pitch correlation of the noise given by: 

< R^ p (k)>~a r < R„ tP (k - 1) > +(1 ~a , ) • R p , ^ g 

where R p is the input pitch correlation of the frame. The adaptation constant a is 
preferably adaptive, though a typical value is a = 0.99. 

The background noise to signal ratio may be calculated according to: 



Parametric noise attenuation is suitably limited to an acceptable level, e.g., about 
30 dB, i.e. 

Y (*) = (*) > 0.968? 0.968 :y (A)} 1 

Noise removing module 216 applies weighting to the three basic parameters 
according to the following exemplary equations. The weighting removes the background 
noise component in the parameters by subtracting the contributions from the 
background noise. This provides a noise-free set of parameters (weighted parameters) 
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that are independent from any background noise, are more uniform, and improve the 
robustness of the classification in the presence of background noise. 
Weighted spectral tilt is estimated by: 
K w (k) = K(kko3^y(ky<K N (k)>. 

5 Weighted absolute maximum is estimated by: 

x w (*) « x (k) -y (*> < x N (*) > - ^) 

Weighted pitch correlation is estimated by: 

The derived parameters may then be compared in decision logic 208. Optionally, 
Sp it may be desirable to derive one or more of the following parameters depending upon 
y the particular application. Optional module 21 8 includes any number of additional 
M parameters which may be used to further aid in classifying the frame. Again, the 

following parameters and/or equations are merely intended as exemplary and are in no 

B way intended as limiting. 

d i 

15 In one embodiment, it may be desirable to estimate the evolution of the frame in 

%s * accordance with one or more of the previous parameters. The evolution is an 

estimation over an interval of time (e.g., 8 times/frame) and is a linear approximation. 
Evolution of the weighted tilt as the slope of the first order approximation, given 

by: 

£r(K w (*-7+/)-M*-7)) 



7 

" y 1 



Z' 2 

20 m (15) 
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Evolution of the weighted maximum as the slope of the first order approximation, 
given by: 

2>6u(*-7+0-x„(*-7)) 



dX w (*) = 



7 

_ i=\ 



7 

■ 2 



£' 3 

(16) 



(18) 



In yet another embodiment, once the parameters of equations 6 through 16 are 
updated for the exemplary eight sample points of the frame, the following frame based 
parameters may be calculated: 

Maximum weighted pitch correlation (maximum of the frame), given by: 

=max^(A-7+/) J / = 0,U,7}. (17) 

Average weighted pitch correlation given by: 

Running mean of average weighted pitch correlation, given by: 
<Jt%(m)>=a 2 .<R%(m-l)>+(l-a 1 yR% t 

where m is the frame number and = 0.75 is an exemplary adaptation constant. 

Minimum weighted spectral tilt, given by: 
k- =min|c w (A-7 + /),/ = 0,l,..,7}. 

Running mean of minimum weighted spectral tilt, given by: 
(m)>=a 2 <K"* a (m-l)>+(l-a 2 )-Kf . 

(21) 

Average weighted spectral tilt, given by: 

(22) 
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Minimum slope of weighted tilt (indicates the maximum evolution in the direction 
of negative spectral tilt in the frame) given by: 

Accumulated slope of weighted spectral tilt (indicates the overall consistency of 
the spectral evolution), given by: 

1=0 ( 24) 
Maximum slope of weighted maximum, given by: 

Accumulated slope of weighted maximum, given by: 

™ (26) 
In general, the parameters given by equations 23, 25 and 26 may be used to 
mark whether a frame is likely to contain an onset (i.e., point where voiced speech 
starts). The parameters given by equations 4 and 18-22 may be used to mark whether 
a frame is likely to be dominated by voiced speech. 

Referring now to Figure 3, decision logic 208 is illustrated in block format 
according to one embodiment of the present invention. Decision logic 208 is a module 
designed to compare all the parameters with a set of thresholds. Any number of 
desired parameters, illustrated generally as (1, 2, . . . k), may be compared in decision 
logic 208. Typically, each parameter or a group of parameters will identify a particular 
characteristic of the frame. For example, characteristic #1 302 may be speech vs. non- 
speech detection. In one embodiment, the VAD may indicate exemplary characteristic 
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#1 . If the VAD determines the frame is speech, the speech is typically further identified 
as voiced (vowels) vs. unvoiced (e.g., "s"). Characteristic #2 304 may be, for example, 
voiced vs. unvoiced speech detection. Any number of characteristics may be included 
and may comprise one or more of the derived parameters. For example, generally 
5 identified characteristic #M 306 may be onset detection and may comprise derived 
parameters from equations 23, 25 and 26. Each characteristic may set a flag or the like 
to indicate the characteristic has or has not been identified. 

The final decision as to which class the frame belongs is preferably decided in a 
final decision module 308. All of the flags are received and compared with priority, e.g., 
|p the VAD as highest priority in module 308. In the present invention, the parameters are 
01 derived from the speech itself and are free from the influence of background noise; 
W therefore, the thresholds are typically unaffected by changing background noise. In 
[" general, a series of "if-then" statements may compare each flag or a group of flags. For 
JL example, assuming each characteristic (flag) is represented by a parameter, in one 
jjl embodiment, an "if statement may read; "if parameter 1 is less than a threshold, then 
q place in class X." In another embodiment, the statement may read; "if parameter 1 is 
less than a threshold and parameter 2 is less than a threshold and so on, then place in 
class X." In yet another embodiment, the statement may read; "if parameter 1 times 
parameter 2 is less than a threshold, then place in class X." One skilled in the art can 
20 readily recognize that any number of parameters either alone or in combination can be 
included in an appropriate "if-then" statement. Of course, there may be equally effective 
methods for comparing the parameters, all of which are intended to be included in the 
scope of the invention. 
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Additionally, final decision module 308 may include an overhang. Overhang, as 
used herein, shall have the meaning common in the industry. In general, overhang 
means that the history of the signal class is considered, i.e., after certain signal classes 
that same signal class is favored somewhat, e.g., at a gradual transition from voiced to 
5 unvoiced the voiced class is favored somewhat in order not to classify the segments 
with a low degree of voiced speech as unvoiced too early. 

By way of demonstration, a brief description of some exemplary classes will 
follow. It should be appreciated that the present invention may be used to classify 
speech into any number or combination of classes and the following description is 

UP included merely to introduce the reader to one possible set of classes. 

43 

01 The exemplary eX-CELP algorithm classifies the frame into one of 6 classes 

according to dominating features of the frame. The classes are labeled: 
"f\ 0. Silence/Background Nose 

1 . Noise-Like Unvoiced Speech 
335 2. Unvoiced 

M 3. Onset 

4. Plosive, not used 

5. Non-Stationary Voiced 

6. Stationary Voiced 

20 In the illustrated embodiment, class 4 is not used, thus the number of classes is 

6. In order to effectively make use of the information available in the encoder, the 
classification module may be configured so that it does not initially distinguish between 
classes 5 and 6. This distinction is instead done during another module outside of the 
classifier where additional information may be available. Furthermore, the classification 

25 module may not initially detect class 1 , but may be introduced during another module 
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based on additional information and the detection of noise-like unvoiced speech. 
Hence, in one embodiment, the classification module may distinguish between 
silence/background noise, unvoiced, onset, and voiced using class number 0, 2, 3 and 5 
respectively. 

5 Referring now to Figure 4, an exemplary module flow chart is illustrated in 

accordance with one embodiment of the present invention. The exemplary flow chart 
may be implemented using C code or any other suitable computer language known in 
the art. In general, the steps illustrated in Figure 4 are similar to the foregoing 
disclosure. 

Ad A digitized speech signal is input to an encoder for processing and compression 

Si into the bitstream, or a bitstream into a decoder for reconstruction (step 400). The 
W signal (usually frame by frame) may originate, for example, from a cellular phone 
l) (wireless), the Internet (voice over IP), or a telephone (PSTN). The present system is 
n especially suited for low bit rate applications (4 kbits/s), but may be used for other bit 
1 5 rates as well. 

O The encoder may include several modules which perform different functions. For 

example, a VAD may indicate whether the input signal is speech or non-speech (step 
405). Non-speech typically includes background noise, music and silence. Non-speech, 
such as background noise, is stationary and remains stationary. Speech, on the other 

20 hand, has pitch and thus the pitch correlation varies between sounds. For example, an 
"s" has very low pitch correlation, but an "a" has high pitch correlation. While Figure 4 
illustrates a VAD, it should be appreciated that in particular embodiments a VAD is not 
required. Some parameters could be derived prior to removing the noise component, 
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and based on those parameters it is possible to estimate whether the frame is 
background noise or speech. The basic parameters are derived (step 415), however it 
should be appreciated that some of the parameters used for encoding may be 
calculated in different modules within the encoder. To avoid redundancy, those 
5 parameters are not recalculated in steps 415 (or subsequent steps 425, 430) but may 
be used to derive further parameters or just passed on to classification. Any number of 
basic parameters may be derived during this step, however, by way of example, 
previously disclosed equations 1-5 are suitable. 

The information from the VAD (or its equivalent) indicates whether the frame is 
flp speech or non-speech. If the frame is non-speech, the noise parameters (e.g., the 
01 mean of the noise parameters) may be updated (step 410). Many variations of 
W equations for the parameters of step 410 may be derived, however, by way of example, 
I *) previously disclosed equations 6-1 1 are suitable. The present invention discloses a 
JU { method for classifying which estimates the parameters of clean speech. This is 
?|j5 advantageous, for among other reasons, because the ever-changing background noise 
n will not significantly affect the optimal thresholds. The noise-free set of parameters is 
obtained by, for example, estimating and removing the noise component of the 
parameters (step 425). Again by way of example, previously disclosed equations 12-14 
are suitable. Based upon the previous steps, additional parameters may or may not be 
20 derived (step 430). Many variations of additional parameters may be included for 
consideration, but by way of example, previously disclosed equations 15-26 are 
suitable. 
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Once the desired parameters are derived, the parameters are compared against 
a set of predetermined thresholds (step 435). The parameters may be compared 
individually or in combinations with other parameters. There are many conceivable 
methods for comparing the parameters, however, the previously disclosed series of "if- 
5 then" statements are suitable. 

It may be desirable to apply an overhang (step 440). This simply allows the 
classifier to favor certain classes based on the knowledge of the history of the signal. 
Hereby, it becomes possible to take advantage of the knowledge of how speech signals 
evolve on a slightly longer term. The frame is now ready to be classified (step 445) into 
lb one of many different classes depending upon the application. By way of example, the 
%: previously disclosed classes (0-6) are suitable, but are in no way intended to limit the 
~f\ invention's applications. 

\j The information from the classified frame can be used to further process the 

0 speech (step 450). In one embodiment, the classification is used to apply weighting to 
(15 the frame (e.g., step 450) and in another embodiment, the classification is used to 
y determine the bit rate (not shown). For example, it is often desirable to maintain the 
periodicity of voiced speech (step 460), but maintain the randomness (step 465) of 
noise and unvoiced speech (step 455). Many other uses for the class information will 
become apparent to those skilled in the art. Once all the processes have been 
20 completed within the encoder, the encoder's function is over (step 470) and the bits 
representing the signal frame may be transmitted to a decoder for reconstruction. 
Alternatively, the foregoing classification process may be performed at the decoder 
based on the decoded parameters and/or on the reconstructed signal. 
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The present invention is described herein in terms of functional block 
components and various processing steps. It should be appreciated that such 
functional blocks may be realized by any number of hardware components configured to 
perform the specified functions. For example, the present invention may employ 
5 various integrated circuit components, e.g., memory elements, digital signal processing 
elements, logic elements, look-up tables, and the like, which may carry out a variety of 
functions under the control of one or more microprocessors or other control devices. In 
addition, those skilled in the art will appreciate that the present invention may be 
practiced in conjunction with any number of data transmission protocols and that the 
Wo system described herein is merely an exemplary application for the invention. 
? It should be appreciated that the particular implementations shown and described 

Hi herein are illustrative of the invention and its best mode and are not intended to limit the 
5 scope of the present invention in any way. Indeed, for the sake of brevity, conventional 
h techniques for signal processing, data transmission, signaling, and network control, and 
fli5 other functional aspects of the systems (and components of the individual operating 
H components of the systems) may not be described in detail herein. Furthermore, the 
connecting lines shown in the various figures contained herein are intended to represent 
exemplary functional relationships and/or physical couplings between the various 
elements. It should be noted that many alternative or additional functional relationships 
20 or physical connections may be present in a practical communication system. 

The present invention has been described above with reference to preferred 
embodiments. However, those skilled in the art having read this disclosure will 
recognize that changes and modifications may be made to the preferred embodiments 
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without departing from the scope of the present invention. For example, similar forms 
may be added without departing from the spirit of the present invention. These and 
other changes or modifications are intended to be included within the scope of the 
present invention, as expressed in the following claims. 
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Abstract 

A method for robust speech classification in speech coding and, in particular, for 
robust classification in the presence of background noise is herein provided. A noise- 
free set of parameters is derived, thereby reducing the adverse effects of background 
noise on the classification process. The speech signal is identified as speech or non- 
speech. A set of basic parameters is derived for the speech frame, then the noise 
component of the parameters is estimated and removed. If the frame is non-speech, 
the noise estimations are updated. All the parameters are then compared against a 
predetermined set of thresholds. Because the background noise has been removed 
from the parameters, the set of thresholds is largely unaffected by any changes in the 
noise. The frame is classified into any number of classes, thereby emphasizing the 
perceptually important features by performing perceptual matching rather than 
waveform matching. 
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Claims 

1 . A method for obtaining a set of parameters used for classification comprising the 
steps of: 

(a) receiving a signal at a processing unit; 

(b) providing at least one basic parameter corresponding to the signal; 

(c) if present, estimating a noise component of the parameter; and 

(d) if present, removing the noise component from the parameter. 

2. The method of claim 1 further comprising the step of determining whether the 
signal is speech or non-speech. 

3. The method of claim 1 further comprising the step of providing at least one 
additional parameter. 

4. The method of claim 3 wherein the noise component is present and the step of 
providing at least one additional parameter is in response to the noise component. 

5. The method of claim 2 further comprising the step of updating the noise 
parameters if the signal is non-speech. 

6. The method of claim 1 wherein the step of providing comprises deriving at least 
one basic parameter corresponding to the signal. 

7. The method of claim 1 wherein the step of providing comprises receiving at least 
one basic parameter corresponding to the signal. 

1 8. A method for classifying speech comprising the steps of: 

2 (a) receiving a speech-related signal at a processing unit; 

3 (b) providing at least one parameter to be used for classifying the signal; 
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(c) estimating a noise component of the parameter; 

(d) removing the noise component from the parameter; 

(e) comparing the parameter with a set of at least one threshold; and 

(f) associating the signal with a class in response to the comparing step. 

9. The method of claim 8 further comprising the step of determining whether the 
signal is speech or non-speech. 

1 0. The method of claim 9 further comprising the step of updating a noise component 
if the signal is non-speech. 

1 1 . The method of claim 8 wherein at least one parameter is derived to classify the 
signal. 

1 2. The method of claim 1 1 wherein a set of basic parameters is derived and at least 
one noise component parameter. 

1 3. The method of claim 8 wherein said comparing step comprises: 

(a) identifying at least one characteristic of the signal with at least one the 
parameters; 

(b) setting a flag to indicate the characteristic is present; 

(c) receiving at least one flag in a final decision module; and 

(d) associating a class with at least one flag. 

1 4. The method of claim 8 wherein at least one parameter is received to classify the 
signal. 
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1 15. A method for perceptually matching a speech signal in a speech coding device 

2 having at least one process module, the method comprising the steps of: 

3 (a) receiving the signal at the speech coding device; 

4 (b) deriving a plurality of signal parameters in the process module; 

5 (c) weighting the parameters; 

6 (d) associating a particular signal characteristic with the signal parameters; 

7 (e) setting a flag in the process module when the characteristic is identified; 

8 (f) comparing the flags; and 

9 (g) classifying the signal according to one of the comparing step or the deriving step. 

16. The method of claim 15 wherein said deriving step comprises deriving a set of 
li basic parameters and deriving a set of noise-related parameters. 
1 17. The method of claim 15 wherein said weighting step comprises: 
I (a) estimating a noise component of the parameter in the process modules; and 
ij (b) removing the noise component of the parameter in the process module. 
;] 18. The method of claim 1 7 wherein said weighting step comprises a set of noise 

estimation equations. 

1 9. A method for speech coding whereby a set of homogeneous parameters is 
provided for classifying a signal, the set of parameters being uninfluenced by a 
background noise. 

1 20. A method for speech communication whereby influence from speech-related 

2 noise is reduced, the method comprising the steps of: 

3 (a) receiving a digital speech-related signal at a speech processing device; 
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(b) forming a set of homogenous parameters; 

(c) comparing the parameters with a threshold; and 

(d) classifying the signal. 

21 . The method of claim 20, wherein the forming step comprises forming a set of 
"noise-free" parameters. 

22. The method of claim 21 , wherein the forming step comprises: 
(b1 ) estimating a noise component; and 

(b2) removing the noise component. 

23. The method of claim 20, wherein the comparing step is with a set of thresholds. 
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